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Research on Intelligent Control, supported by the NASA Lewis 
Research Center and the U.S. Army, has been conducted by the 
Department of Systems Engineering at Case Western Reserve 
University. This work began in 1 987 with an initial research 
contract to support a literature survey and problem formulation for 
the concept of intelligent control. Several questions were asked in 
the earlier work in an attempt to focus on concepts and ideas that 
would be relevant for future research studies. During the initial 
period of this work a detailed report on the methods and techniques 
from systems and control theory, hierarchical and multilevel 
systems theory, expert systems and Al, learning systems and 
automata theory, which were relevant to the general area of 
intelligent control was prepared. Also, a study was conducted to 
determine the performance of humans in control tasks. 

The experiment was based on a computer simulation of the 
classical pole balancing problem, where a concentrated mass is 
located at the end of an inverted rod attached to a cart which can 
move on a horizontal surface. The simulation included various 
methods of representing the system data to the operator. For 
example, in one simulation the pole and cart system was graphically 
displayed and the operator could view the time evolution of the 
system as forces were applied to the cart. In another operating 
mode, a bouncing ball whose frequency was proportional to the 
velocity of the pendulum was displayed. The operator had to discover 
by trial and error which direction of the ball's motion was 
associated with clockwise and counterclockwise motion of the 
pendulum; a failure condition was always given to the operator when 
the pendulum would fall to the horizontal position. For the 
experiment, disks which contained this experiment were distributed 
to different people, some technical and some non technical, to 
determine the ability of these individuals to "learn" an appropriate 
control law. The level of force and the initial angular perturbation of 
the pendulum were randomized over the simulation runs. All the 
control moves of the participants were recorded and then analyzed 
at a later date. 

The conclusions from the experiment were not surprising and 
formed the basis for the technique of "learning" or "intelligent" 
control that we adopted for the first year of our research work: a 
reinforcement learning approach. In a reinforcement learning 
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approach to control, there is a discrete set of control alternatives 
and a performance functional which is used to evaluate the 
effectiveness of the control inputs. The state space of the dynamical 
system that is to be controlled is quantized into a collection of sets 
called situations, and the objective of the reinforcement learning 
controller is to assign to each quantized set a "unique" control value 
which is preferred on the basis of the performance functional. The 
reinforcement learning method uses a reward and penalty scheme to 
adjust a set of probabilities, with one probability associated with 
each control input/situation pair. 

Implementation of the reinforcement learning controller is at 
the direct control level of the intelligent control hierarchy. In the 
problem formulation developed in our first year effort, we proposed 
an intelligent control hierarchy which utilized a functional 
decomposition of the overall control problem. This decomposition 
included: a direct control level that was responsible for responding 
in real time to disturbances in the plant, a planning/optimizing level 
controller which is responsible for modifying set points, parameters 
and performance goals for the direct control level to respond to 
changes in the plant or operating environment and a 
supervisory/explanation facility and user interface to the 
intelligent control system which provides the operator with 
qualitative information and explanations about the process and the 
performance of the intelligent control system. The operator (control 
system user) can use the information supplied by the explanation 
facility to modify process knowledge and goals. This functional 
decomposition is commensurate with a temporal decomposition of 
the control tasks-as the complexity of the decision/control problem 
increases, along with the computational time required to determine 
the appropriate control action or decision, the control task is 
relegated to higher levels of the intelligent control hierarchy. 

The major effort for the first phase of the research work was 
in the implementation and evaluation of the direct level controller. 
The direct level controller incorporates six subsystems for 
learning/control selection. The critic is the evaluation subsystem 
in the direct level controller. This subsystem accepts output data 
from the process and the control database and provides a 
reinforcement (reward)/punishment signal to the learning 
subsystem. An important problem that was addressed in the design 
of the critic was the credit assignment problem. As we are dealing 
with a dynamical system, there is a functional relationship between 


2 



the process inputs and outputs. Hence, the critic must know how to 
assess credit or blame to past and current controls based on the 
current value of the process output. We developed specific 
techniques to deal with this complexity and the details can be found 
in [1 ] or [2]. The learning system uses reinforcement data from the 
critic to adjust the (conditional) probabilities of the 
situation/control pairs; a linear-reward-penalty scheme is used in 
the implementation. The learning system computes an update of the 
situation/control probabilities and provides this information to the 
control database subsystem. Before a control for the current time 
period can be computed, the process output must be analyzed to 
determine the "state" of the system. This involves three subsystems 
of the direct control level, the data monitor, the situation 
recognition unit and the control selection unit. The data monitor is 
analyzing the process output to determine anomalous conditions, 
such as sensor malfunctions, which will affect the quality of the 
data and the performance of the direct level controller. If the data 
monitor passes the output data, it is classified into situations in 
the situation recognition unit. The output space of the process is 
quantized into sets referred to as situations, and the situation 
recognition unit assigns a situation number to the observed process 
output. 

Remark: Quantizing the process output into situations can be 
difficult and can induce complicated behavior in the controlled 
system. The problem is that the direct level controller is attempting 
to assign a unique control value to each situation. However, the 
dynamics of the quantized system can be quite complicated and, in 
fact, in some instances the evolution of the quantized system is 
random [3 and 4]. In such cases, the learning unit is unstable in the 
sense that controls which are rewarded for a particular situation at 
one time are penalized for the same situation at another time. This 
is a direct result of the fact that the output quantization for the 
process does not necessarily define a Markov partition for the 
system's output flow. 

The direct level controller is operating in a closed-loop 
configuration. In such cases, it is well known that identification 
(learning) and control can compete; this is referred to as the dual 
control effect. The problem stems from the fact that if the 
controller is doing a good job regulating the plant, then presumably 
the output of the process remains in a neighborhood of the desired 
set point or trajectory, and the input/output data which is collected 
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is not very informative about the general characteristics of the 
process dynamics-identification is difficult. This so-called dual 
effect is also a problem in the direct controller where learning and 
control are occurring simultaneously. 

The direct level controller was implemented in a Texas 
Instruments Explorer System and tested in simulation for the 
control of an inverted pendulum. This particular problem was chosen 
because it has been used in past experimental (simulation) studies 
to evaluate different methods of intelligent or learning control. 
Although the direct level controller showed reasonable performance 
in learning a stabilizing controller for the inverted pendulum in a 
variety of different operating configurations, on many occasions the 
learning times required by the controller were prohibitively large 
and the control probabilities would exhibit oscillatory behavior. A 
detailed analysis of the phenomena led to the conclusion that it was 
the quantization of the output space into situations and the dual 
effect of the combined learning/controller synthesis that were the 
root causes. These problems were addressed in detail in the second 
year research work. 

The second phase of the research effort concentrated on 
developing a refined implementation of the direct level controller, 
including an adaptive/optimizing level function for the learning 
phases of the controller. As mentioned previously, the direct level 
controller uses a reinforcement learning control paradigm to 
synthesize the control action. The control actions are rewarded if 
they improve the dynamical behavior of the system as measured by a 
performance functional termed the subgoal, and punished otherwise. 
The problem of determining an appropriate subgoal for the 
instantaneous evaluation of the performance of the system, derived 
from the overall performance functional for the process which is 
being controlled is system dependent and, in general, is unsolved. In 
this work we have used a heuristic approach to construct a subgoal 
for the problem of stabilizing the inverted pendulum. No general 
results for arbitrary systems have been determined. 

The direct level controller operates as developed in the phase 
one research effort. The adaptive/optimizing level is developed to 
improve the operation of the direct level controller by adjusting the 
information classifying scheme. The reinforcement learning control 
scheme decomposes the control action synthesis task into: (1 ) 
classifying the input/output data of the process into situations and, 
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(2) determining the control action which maximizes the a posterior 
probability of being the correct control action for the situation 
identified. The controller has two objectives; learn as much as 
possible about the plant and synthesize the best control policy as 
measured by the performance functional for the plant. As in a 
classical adaptive control scheme, the learning and control 
objectives are usually competing. These two objectives are used to 
distinguish between two distinct phases of the learning process. 

During learning, classification of measured data from the plant 
into situations is based on neighborhoods defined in the input/output 
space of the process. The neighborhoods are induced by a similarity 
metric and the learning process is decomposed into two phases: the 
creation phase and the refinement phase. In the creation phase, 
controls are applied randomly to the process in an attempt to 
stimulate all modes of the system and enhance 
identification/learning at the possible expense of control 
performance. In the refinement phase of the learning process, 
control actions are determined by their expected success in terms of 
a subgoal objective and the topology of the neighborhoods are 
altered in an attempt to find a partition of the output space of the 
process such that a unique control is associated with each 
input/output situation pair. 

A unique feature of the work is the introduction of the concept 
of entropy as a means of guiding and evaluating the determination of 
the neighborhoods during the creation phase and the refinement 
phase. The creation phase is identified by the entropy (or 
uncertainty) of each neighborhood being greater than a given, user- 
specified, threshold. Once the entropy has been reduced to less than 
the threshold, the learning switches from the creation phase to the 
refinement phase. For more details the reader is referred to [5]. 

The adaptive/optimization level intervenes during phase two of the 
learning process based on observed anomalies in the direct level 
controller. The anomalies are either events which cause a particular 
control action to increase the entropy associated with a particular 
situation or a partitioning of the output space. The underlying 
concept of intervention is that by altering the topology of the 
neighborhoods, the partition of the output space of the process, the 
behavior of the learning control scheme can be improved. Although it 
is not always true, smaller neighborhoods usually improve the 
controller performance at the cost of additional computational 
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complexity. One possible intervention strategy is to adapt the 
threshold of the similarity metric which defines the neighborhoods. 
Adjustments of the threshold treats all directions in the output 
space uniformly and such a scheme can result in deterioration of the 
overall performance of the controller. Therefore we have chosen to 
use a parametric adjustment of the weighting matrix in the 
quadratic similarity metric which is used to classify input/output 
patters into situations. A gradient based algorithm is derived to 
provide adjustments to the similarity metric. Refer to [5] for 
details. 

The final accomplishments of the work in phase two were 
refinements and enhancements to the implementation of the 
intelligent controller on the Tl Explorer computer system. The 
windows environment and graphics capability of the Explorer system 
were exploited to develop a user interface. With this user interface 
and the incorporation of animation into the system makes it a 
suitable platform for development work in intelligent control. 

The third and final phase of the intelligent control system 
research effort was aimed at relaxing some of the restrictions of 
the reinforcement/learning control paradigm which formed the basis 
of the direct level controller. Two approaches were taken during this 
work, both incorporating the use of a priori information into the 
realization of the intelligent control system. The first approach was 
to consider an alternative learning control method based on 
feedforward neural networks for a special class of nonlinear 
dynamical systems; the class of linear-analytic systems. The other 
approach was to use a priori system information to develop methods 
that would extend the capabilities of the reinforcement/learning 
control approach. We mentioned earlier the problem which results 
because of quantization of the output space of the process, the other 
problem is the quantization of the input or control space. This issue 
was studied as part of the third phase of the research work. 

Linear-analytic systems are a general class of nonlinear 
systems where the control input enters linearly into the system 
dynamics and the vector field which defines the system flow when 
the input is fixed is made up of analytic functions of the state of the 
system. This class of systems is important for at least two reasons: 
(1) many nonlinear systems can be represented by, or approximated 
by, dynamical systems of this form, and (2) from this class of 
systems it is possible to develop a theory of control system 
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synthesis which closely resembles the well known linear theory. Our 
approach was to utilize the fact that for systems of the linear- 
analytic type, there exists a theory of control synthesis in which a 
feedback control is derived which linearizes the input/output 
dynamical behavior of the system. If we could develop an intelligent 
control structure that would learn the linearizing feedback 
controller, then classical linear control methods could be used on 
the linearized system to obtain the desired closed-loop system 
performance. The realization of the intelligent controller chosen for 
this part of the work was in terms of a feedforward neural network, 
where unsupervised learning methods were developed for this 
application to guide the selection of an appropriate linearizing 
feedback control input.. 

We began with the assumption that the linear-analytic system 
was feedback linearizable and then used this information to select 
the appropriate form of a linear system which was used during 
training. This idea is similar to a model-reference adaptive control 
scheme, except in our implementation a feedforward neural network 
was used as the controller and a gradient based algorithm (an 
extension of the familiar back propagation algorithm for a 
feedforward neural network) was used to adjust the network 
parameters using real-time input/output data from the system. For 
more details on the theory and applications of this work the reader 
is referred to [7] and [8]. 

The alternative approach we investigated for incorporating a 
priori system information into the synthesis of learning control 
strategies was to focus attention on two dimensional systems and 
their geometric properties. As we mentioned earlier, one problem 
with reinforcement/learning schemes is related to partitions of the 
output or state space of the process to be controlled. For control 
problems related to set-point regulation, including stabilization, the 
existence of a suitable control which transfers an initial point to 
the desired final point is determined by the attainability and 
reachability properties of the system. Therefore, the ability of the 
learning control system to determine a suitable control action for a 
particular point-to-point steering control problem also depends on 
these geometric properties of the system. In this work we have used 
methods of characterizing the attainable and reachable sets for a 
dynamical system to enhance the performance of a learning control 
system. The attainable and reachable sets are parameterized by the 
control input which is assumed to be held constant over a fixed time 
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interval, referred to as the control time. This is consistent with a 
digital (discrete time) implementation of the controller where, for 
example, a zero order hold would be used as a reconstruction device. 
For a single input system, we assume the control takes values in a 
compact convex subset (an interval) of the set of real numbers (/?) 
and the control set includes the origin. Given an initial point p, we 
define the attainable set from p ( A(p ) ) on the interval [0,T] to be 
the collection of trajectories of the controlled system initial from 
p, given that the control input ranges over the set of admissible 
control inputs. Similarly, given a target (final) point p, we define 
the reachable set ( R(p ) ) on the interval [0,T] to be the collection of 
trajectories of the controlled system which can be steered to p as 
the control input ranges over the set of admissible control inputs. 

The attainable and reachable sets play important roles in problems 
related to point-to-point steering in control systems. The geometry 
of the sets A(p) and R(p) depends on the characteristics of the 
system and the set of admissible controls. For more details refer to 
the thesis [9] and the papers [10] and [1 1]. 

The problem of intelligent control as formulated in this work 
is to learn an appropriate feedback control strategy which will steer 
a given set of initial points to a given final point on a time interval 
[0,T]. Of the difficulties we encountered with 
reinforcement/learning control in our previous years' work, the 
discretization of the control set and its influence on the dynamical 
system performance was a focus of this research effort in the final 
year of the project. An importsnt issue is that in order to have "fine" 
control of the system the number of partitions of the control set 
(i.e. the number of control values) must be large, but this causes 
computational and numerical problems in the reinforcement/learning 
algorithms. Using a priori information about the system-the 
geometry of the attainable, reachable and admissible control sets- 
we developed an adaptive form of the reinforcement/learning 
control suitable for a broad class of nonlinear two dimensional 
systems. The basic theory behind the method is to use the convexity 
property of controllable sets S in the phase space of a two 
dimensional nonlinear system. In this set S, all points are attainable 
and reachable with respect to all other points in the set and the 
boundary of the set S is determined by extremal trajectories of the 
controlled system. That is, for the case of a single input system, if 
the control set is the interval [a,b], then the extremal trajectories 
are determined by choosing the control to be equal to a or b, 
respectively. In planning a trajectory from an initial point p to a 
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target point t, we select a path in the phase space which consists of 
a collection of attainable and reachable sets which have pairwise 
nonempty intersections. If there is no such path, then the point-to- 
point steering problem has no solution. Then, a collection of 
extremal trajectories forms a boundary for this region and the 
learning algorithm attempts to synthesize an appropriate control 
sequence which will accomplish the desired transfer. Using 
convexity properties of the controllable sets, the algorithm iterates 
on the partition of the control set to continuously refine the 
partition while keeping the number of elements in the partition 
constant. In this way, we have developed an adaptive 
reinforcement/learning scheme which has essentially a continuum 
of control values. There is a course partition of the control set 
which includes the extremal controls, and at each iteration based on 
input/output data from the system and the geometric properties of 
the reachable and attainable sets, the control set partition is 
refined and the learning is continued. More details and a simulation 
study can be found in the thesis [9]. 

This research program has been very productive and a number 
of important issues in intelligent control have been identified and a 
number of important contributions to the theory and application of 
intelligent control methods have been made. Significant 
contributions include: 

1 . The development of a hierarchical framework for intelligent 
control [1 ] and [2]; 

2. The development, implementation and testing of a direct 
level controller based on a reinforcement/learning control 
paradigm [1] and [2]; 

3. The development of an information-theoretic framework for 
adaptive learning to address the difficult "dual" effects of the 
reinforcement/learning controller [5] and [6]; 

4. The implementation of a adaptive/optimizing control layer 
within the intelligent control hierarchy for improving learning 
and control performance [5] and [6]; 

5. The development and implementation of a software based 
simulation and graphically based evaluation tool for use as an 
intelligent control system development tool [1 ] and [5]; 
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6. The study of quantization effects on the dynamic system 
trajectories, including entropy measures and active probing, 
chaos and complicated dynamical system behavior [3], [4] and 
[5] and [6]; 

7. The application of neural networks for direct level control, 
the synthesis of feedback linearizing direct level controllers 
for linear-analytic systems using learning control methods [7] 
and [8]; 

8. The use of a priori system information in learning control 
system synthesis, reachable and attainable sets and an 
adaptive scheme for refining control set partitions to improve 
closed loop control system performance [9]. 
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